76 research outputs found
A model based approach to Spotify data analysis: a Beta GLMM
Digital music distribution is increasingly powered by automated mechanisms that continuously capture, sort and analyze large amounts of Web-based data. This paper deals with the management of songs audio features from a statistical point of view. In particular, it explores the data catching mechanisms enabled by Spotify Web API, and suggests statistical tools for the analysis of these data. Special attention is devoted to songs popularity and a Beta model including random eïŹects is proposed in order to give a ïŹrst answer to questions like: which are the determinants of popularity? The identiïŹcation of a model able to describe this relationship, the determination within the set of characteristics of those considered most important in making a song popular is a very interesting topic for those who aim to predict the success of new products
Clustering alternatives in preference-approvals via novel pseudometrics
Preference-approval structures combine preference rankings and approval voting for
declaring opinions over a set of alternatives. In this paper, we propose a new procedure
for clustering alternatives in order to reduce the complexity of the preferenceapproval
space and provide a more accessible interpretation of data. To that end,
we present a new family of pseudometrics on the set of alternatives that take into
account votersâ preferences via preference-approvals. To obtain clusters, we use the
Ranked k-medoids (RKM) partitioning algorithm, which takes as input the similarities
between pairs of alternatives based on the proposed pseudometrics. Finally,
using non-metric multidimensional scaling, clusters are represented in 2-dimensional
space
The Neutrophil-to-Lymphocyte Ratio is Related to Disease Activity in Relapsing Remitting Multiple Sclerosis
Background: The role of the neutrophil-to-lymphocyte ratio (NLR) of peripheral blood
has been investigated in relation to several autoimmune diseases. Limited studies have addressed
the significance of the NLR in terms of being a marker of disease activity in multiple sclerosis (MS).
Methods: This is a retrospective study in relapsing\u2013remitting MS patients (RRMS) admitted to the
tertiary MS center of Catania, Italy during the period of 1 January to 31 December 2018. The aim of
the present study was to investigate the significance of the NLR in reflecting the disease activity in a
cohort of early diagnosed RRMS patients. Results: Among a total sample of 132 patients diagnosed
with RRMS, 84 were enrolled in the present study. In the association analysis, a relation between
the NLR value and disease activity at onset was found (V-Cramer 0.271, p = 0.013). In the logistic
regression model, the variable NLR (p = 0.03 ExpB 3.5, CI 95% 1.089\u201311.4) was related to disease
activity at onset. Conclusion: An elevated NLR is associated with disease activity at onset in RRMS
patients. More large-scale studies with a longer follow-up are needed
Variable selection in mixed models: a graphical approach
Model selection can be defined as the task of estimating the performance of dif-
ferent models in order to choose the (approximate) best one. The purpose of this article is to
introduce an extension of the graphical representation of deviance proposed in the framework
of classical and generalized linear models to the wider class of mixed models. The proposed
plot is useful in determining which are the important explanatory variables conditioning on
the random effects part. The applicability and the easy interpretation of the graph are illus-
trated with a real data examples
Random forest analysis: a new approach for classication of Beta Thalassemia
In recent years, Thalassemia care providers started classifying patients as transfusion-
dependent-Thalassemia (TDT) or non-transfusion-dependent-Thalassemia (NTDT) owing to
the established role of transfusion therapy in dening the clinical complication prole, although
this classication was also based on expert opinion and is limited by reliance on patients'current
transfusion status. Starting from a vast set of variables indicating severity phenotype, through
the use of both classication and clustering techniques we want to explore the presence of
two (TDT vs NTDT) or more clusters, in order to approaching to a new denition for the
classication of Beta-Thalassemia in Thalassemia Syndromes (TS)
Weighted and unweighted distances based decision tree for ranking data
Preference data represent a particular type of ranking data (widely used
in sports, web search, social sciences), where a group of people gives their preferences
over a set of alternatives. Within this framework, distance-based decision
trees represent a non-parametric tool for identifying the profiles of subjects giving
a similar ranking. This paper aims at detecting, in the framework of (complete
and incomplete) ranking data, the impact of the differently structured weighted distances
for building decision trees. The traditional metrics between rankings donât
take into account the importance of swapping elements similar among them (element
weights) or elements belonging to the top (or to the bottom) of an ordering
(position weights). By means of simulations, using weighted distances to build decision
trees, we will compute the impact of different weighting structures both on
splitting and on consensus ranking. The distances that will be used satisfy Kemenys
axioms and, accordingly, a modified version of the rank correlation coefficient Ïx,
proposed by Edmond and Mason, will be proposed and used for assessing the treesâ
goodness
GAMLSS for high-variability data: an application to liver fibrosis case.
In this paper, we propose management of the
problem caused by overdispersed data by applying the generalized additive model
for location, scale and shape framework (GAMLSS) as introduced by Rigby and
Stasinopoulos (2005). The idea of using a GAMLSS approach for handling our
problem comes from the idea of Aitkin (1996) consisting in the use of an EM maximum
likelihood estimation algorithm (Dempster, Laird, and Rubin, 1977) to deal
with overdispersed generalized linear models (GLM). As in the GLM case, the algorithm
is initially derived as a form of Gaussian quadrature assuming a normal
mixing distribution. The GAMLSS specification allows the extension of the Aitkin
algorithm to probability distributions not belonging to the exponential family. In
particular, aim of this work is to show the importance of using a GAMLSS strutcure
when a mixture is used to provide a natural representation of heterogeneity in a finite
number of latent classes (Celeux and Diebolt, 1992)
Classification trees for preference data: a distance-based approach
In the framework of preference rankings, when the interest lies in
explaining which predictors and which interactions among predictors are able
to explain the observed preference structures, the possibility to derive consensus
measures using a classi cation tree represents a novelty and an important tool
given its easy interpretability. In this work we propose the use of a multivariate
decision tree where a weighted Kemeny distance is used both to evaluate the
distances between rankings and to de ne an impurity measure to be used in the
recursive partitioning. The proposed approach allows also to weight di erently
high distances in rankings in the top and in the bottom alternatives
Dealing with the Pseudo-Replication Problem in Longitudinal Data from Posidonia Oceanica Surveys: Modeling Dependence vs. Subsampling
Posidonia oceanica represents the key species of the most important ecosystem in subtidal habitats of the Mediterranean Sea. Being sensitive to changes in the environment, it is considered a crucial indicator of the quality of coastal marine waters.
A peculiarity of P. oceanica is the presence of reiterative modules characterizing its growth, which lend themselves to back-dating techniques, allowing for the reconstruction of past history of growth variables (annual rhizome elongation and diameter, primary production, etc.).
Such back-dating techniques provide, for each sampled shoot, a longitudinal series of multivariate data; this is an instance of what Hurlbert (1984) in a seminal paper defined as âpseudo-replicationsâ, for which it becomes crucial to take into account the possible dependence of the data.
A common solution to the âpseudo-replicationsâ in the ecological literature is represented by âpseudo-replicationsâ: given repeated measurements on the same unit, only a random sub-sample of such measurements is analyzed, in order to attenuate correlation and obtain approximately independently distributed observations, to which standard statistical methods can be applied. In its most extreme version, only one measurement is randomly drawn for each unit, i.e. the sub-sampling size is one. If on one hand sub-sampling attenuates correlation, on the other it implies a loss of information (due to the reduction of the total sample size) and then requires a higher number of sampling units to ensure a specified level of efficiency and power.
In the talk, we contrast sub-sampling with the alternative approach of handling dependence directly in the modelling stage, using the class of Generalized Linear Mixed Models. We show that this approach permits remarkable gains of precision in estimation and power in testing, without requiring the increase in sample sizes involved in sub-sampling, and thus avoiding the practice of over-sampling, which has a negative impact on aquatic ecosystems
- âŠ